AITopics | Translational Bioinformatics

MSNGO: multi-species protein function annotation based on 3D protein structure and network propagation

Wang, Beibei, Cui, Boyue, Chen, Shiqu, Wang, Xuan, Wang, Yadong, Li, Junyi

arXiv.org Artificial IntelligenceMar-29-2025

Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross-species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and PPI networks. Availability: https://github.com/blingbell/MSNGO.

artificial intelligence, bioinformatics, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.23014

Country:

Asia > China (0.30)
North America > United States (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Learning Complete Protein Representation by Dynamically Coupling of Sequence and Structure

Neural Information Processing SystemsMar-27-2025, 15:42:20 GMT

Learning effective representations is imperative for comprehending proteins and deciphering their biological functions. Recent strides in language models and graph neural networks have empowered protein models to harness primary or tertiary structure information for representation learning. Nevertheless, the absence of practical methodologies to appropriately model intricate inter-dependencies between protein sequences and structures has resulted in embeddings that exhibit low performance on tasks such as protein function prediction. In this study, we introduce CoupleNet, a novel framework designed to interlink protein sequences and structures to derive informative protein representations.

bioinformatics, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.87)

Add feedback

Multi-modal Transfer Learning between Biological Foundation Models Patrick Bordes

Neural Information Processing SystemsMar-23-2025, 20:40:06 GMT

Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology.

bioinformatics, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States > Hawaii (0.14)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Add feedback

3fd60983292458bf7dee75f12d5e9e05-Paper.pdf

Neural Information Processing SystemsMar-23-2025, 06:59:04 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: North America (0.28)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)

Add feedback

5e956fef0946dc1e39760f94b78045fe-Paper-Conference.pdf

Neural Information Processing SystemsMar-23-2025, 00:30:34 GMT

artificial intelligence, bayesian inference, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Neural Information Processing SystemsMar-22-2025, 21:16:07 GMT

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights.

bioinformatics, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

A Generative Nonparametric Bayesian Model for Whole Genomes

Alan N. Amin, Eli N. Weinstein, Debora S. Marks

Neural Information Processing SystemsMar-22-2025, 20:26:24 GMT

artificial intelligence, bayesian inference, machine learning, (19 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.93)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Vishniakov, Kirill, Amor, Boulbaba Ben, Tekin, Engin, ElNaker, Nancy A., Viswanathan, Karthik, Medvedev, Aleksandr, Singh, Aahan, Nadeem, Maryam, Sayeed, Mohammad Amaan, Kanithi, Praveenkumar, Magalhaes, Tiago, Vassilieva, Natalia, Mahapatra, Dwarikanath, Pimentel, Marco, Khan, and Shadab

arXiv.org Artificial IntelligenceMar-20-2025

We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.

bioinformatics, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2503.16565

Country: